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Abstract 

We introduce a novel robust hybrid 3D face track¬ 
ing framework from RGBD video streams, which is capa¬ 
ble of tracking head pose and facial actions without pre¬ 
calibration or intervention from a user. In particular, we 
emphasize on improving the tracking performance in in¬ 
stances where the tracked subject is at a large distance 
from the cameras, and the quality of point cloud deterio¬ 
rates severely. This is accomplished by the combination of a 
flexible 3D shape regressor and the joint 2D-\-3D optimiza¬ 
tion on shape parameters. Our approach fits facial blend- 
shapes to the point cloud of the human head, while being 
driven by an efficient and rapid 3D shape regressor trained 
on generic RGB datasets. As an on-line tracking system, the 
identity of the unknown user is adapted on-the-fiy resulting 
in improved 3D model reconstruction and consequently bet¬ 
ter tracking performance. The result is a robust RGBD face 
tracker, capable of handling a wide range of target scene 
depths, beyond those that can be afforded by traditional 
depth or RGB face trackers. Lastly, since the blendshape 
is not able to accurately recover the real facial shape, we 
use the tracked 3D face model as a prior in a novel filter¬ 
ing process to further refine the depth map for use in other 
tasks, such as 3D reconstruction. 

1. Introduction 

Tracking dynamic expressions of human faces is an im¬ 
portant task, with recent methods (261 El EH El [El achiev¬ 
ing impressive results. However, difficult problems remain 
due to variations in camera pose, video quality, head move¬ 
ment and illumination, added to the challenge of tracking 
different people with many unique facial expressions. 



(a) (b) (c) (d) 

Figure 1. A tracking result of our proposed method, (a) The 3D 
landmarks projected to color frame, (b) The 3D blendshape. (c) 
The 3D frontal view, with the blendshape model in red and input 
point cloud in white, (d) The 3D side view. 

Early work on articulated face tracking was based on Ac¬ 
tive Shape and Appearance Models (121 (HI 1201 that fit a 
parametric facial template to the image. The facial tem¬ 
plate is learned from data and thus the tracking quality is 
limited by the amount of training samples and optimiza¬ 
tion method. Recently, alternative regression-based meth¬ 
ods (131 111 [221 IMl E 2 I have resulted in better performance 
due to greater flexibility and computational efficiency. 

Another common approach for face tracking is to use 3D 
deformable models as priors (HdllTJliEllIllEll- In gen¬ 
eral, a 3D face model is controlled by a set of shape defor¬ 
mation units. In the past, generic wireframe models (WFM) 
were often employed for simplicity. However, WFM can 
only represent a coarse face shape and is insufficient for 
fine-grained face tracking when dense 3D data is available. 

Blendshape-based face models, such as the shape ten¬ 
sor used in the FaceWareHouse Q, were developed for 
more sophisticated, accurate 3D face tracking. By deform¬ 
ing dense 3D blendshapes to fit facial appearances, facial 
motions can be estimated with high fidelity. Such tech¬ 
niques have gained attention recently due to the prolifera¬ 
tion of consumer-grade range sensing devices, such as the 
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Microsoft Kinect Ea, which provide synchronized color 
(RGB) images and depth (D) maps in real time. By integrat¬ 
ing blendshapes into dynamic expression models (DEM), 
several approaches Giiaiiii have demonstrated state-of- 
the-art tracking performance on RGBD input. It can be 
observed that all of these tracking frameworks rely heav¬ 
ily on the quality of input depth data. However, existing 
consumer-grade depth sensors tend to provide increasingly 
unreliable depth measurements when the objects are far¬ 
ther (221. Therefore, these methods ||28l|3l|T8l only work 
well at close range, where the depth map retains fine struc¬ 
tural details of the face. In many applications, such as room¬ 
sized teleconferencing, the individuals tracked may be lo¬ 
cated at considerable distances from the camera, leading to 
poor performance with existing methods. 

One way of addressing depth sensor limitations is to use 
color as in Eia. These RGB-based methods require ex¬ 
tensive training to learn a 3D shape regressor. The learned 
regressor serves as a prior for DEM registration to RGB 
frames. Despite the high training cost, these methods have 
tracking results comparable to RGBD-based approaches. 
Although RGB-only methods are not affected by inaccu¬ 
rate depth measures, it is still challenging to track with high 
fidelity at large object-camera distances. This is in part due 
to reduced reliability of regression-based updates at lower 
image resolutions, when there is less data for overcoming 
depth ambiguity. Instead, we expect to achieve better track¬ 
ing results if we were able to incorporate depth data while 
intelligently handling its inaccuracies at greater distances. 

This motivates us to propose a robust RGBD face tracker 
combining the advantages of RGB regression and 3D point 
cloud registration. Our contributions are as follows: 

• Our tracker is guided by a multi-stage 3D shape re¬ 
gressor based on random forests and linear regression, 
which maps 2D image features back to blendshape pa¬ 
rameters for a 3D face model. This 3D shape regressor 
bypasses the problem of noisy depth data when obtain¬ 
ing a good initial estimate of the blendshape. 

• The subsequent joint 2 D-f 3D optimization matches the 
facial blendshape to both image and depth data ro¬ 
bustly. This approach does not require an apriori 
blendshape model of the user, as shape parameters are 
updated on-the-fiy. 

• Extensive experiments show that our 3D tracker per¬ 
forms robustly across a wide range of scenes and vi¬ 
sual conditions, while maintaining or surpassing the 
tracking performance of other state-of-the-art trackers. 

• We use the DEM blendshape as a prior in a depth filter¬ 
ing process, further improving the depth map for fine 
3D reconstruction. 

The rest of this paper is organized as follows. Section]^ 
outlines our proposed 3D face tracking framework. Sec¬ 


tion describes the 3D shape regression in detail. DEM 
registration is further elaborated in Section]^ Section [^de¬ 
scribes our depth recovery method using a 3D face prior. 
Section [^presents the experimental tracking and depth re¬ 
covery results. 

2. System Overview 

In this section we present the blendshape model that we 
use in this work, and our proposed tracking framework. 

2.1. The Face Representation 

We use the face models developed in the EaceWarehouse 
database O- As specified in (71, a facial expression of a 
person can be approximated by 

V = CrX2wliX3wJ^p (1) 

where Cr is a 3D matrix (called reduced core tensor) of size 
(Ny, Nid, Ne) (corresponding to number of vertices, num¬ 
ber of identities and number of expressions, respectively), 
Wid is an A^^^^-dimension identity vector, and Wexp is an N^- 
dimension expression vector. Q basically describes tensor 
contraction at the 2nd mode by Wid and at the 3rd mode by 

"^exp- 

Similar to for real-time face tracking of one person, 
given his identity vector Wid, it is more convenient to re¬ 
construct the Ne expression blendshapes for the person of 
identity Wid as 

Bj =CrX2wJ^x (2) 

where r^exp^ is the pre-computed weight vector for the j-th 
expression mode El In this way, an arbitrary facial shape 
of the person can be represented as a linear sum of his ex¬ 
pression blendshapes: 

V = Bo+Y1 {Bj - Bo)ej (3) 

i=i 

where Bq is the neutral shape, and Cj G [0,1] is the blending 
weight, j = 1,..., A^e — 1- Einally, a fully transformed 3D 
facial shape can be represented as 

S = R’ V{B, e) + T (4) 

with the parameters 0 = {R^T^e), where R and T re¬ 
spectively represent global rotation and translation, and 
e = {ej} defined in ^ represent the expression deforma¬ 
tion parameters. In this work, we keep the 50 most signif¬ 
icant identity knobs in the reduced core tensor Cy, hence 
(Ny,Nid,Ne) = (11510, 50, 47). 



Figure 2. The pipeline of the proposed method. 


2.2. Framework Pipeline 

Fig. shows the pipeline of the proposed face tracking 
framework, which follows a coarse-to-fine multi-stage op¬ 
timization design. In particular, our framework consists of 
two major stages: shape regression and shape refinement. 
The shape regressor performs the first optimization stage, 
which is learned from training data, to quickly estimate 
shape parameters 0 from the RGB frame (cf. Section [^. 
Then, in the second stage, a carefully designed optimiza¬ 
tion is performed on both the 2D image and the available 
3D point cloud data to refine the shape parameters, and fi¬ 
nally the identity parameter Wid is updated to improve shape 
fitting to the input RGBD data (cf. Section]^. 

The 3D shape regressor is the key component to achieve 
our goal of 3D tracking at large distance, where quality of 
the depth map is often poor. Unlike the existing RGBD- 
based face tracking works, which either heavily rely on 
the accurate input point cloud (at close distances) to model 
shape transformation by ICP (281 0 or tise off-the-shelf 2D 
face tracker to guide the shape transformation ca, we pre¬ 
dict the 3D shape parameters directly from the RGB frame 
by the developed 3D regressor. This is motivated by the 
success of the 3D shape regression from RGB images used 
in laia. The approach is especially meaningful for our con¬ 
sidered large distance scenarios, where the depth quality is 
poor. Thus, we do not make use of the depth information in 
the 3D shape regression to avoid profusion of inaccuracies 
from the depth map. 

Initially, a color frame / is passed through the regressor 
to recover the shape parameters 0. The projection of the 
Ni (Ni = 73) landmarks vertices of the 3D shape to image 
plane typically does not accurately match the 2D landmarks 
annotated in the training data. We therefore include 2D dis¬ 
placements D in 0 into the parameter set and define a new 
global shape parameter set P = {0,D) = {R^T^e^D). 
The advantages of including D in P are two-fold. First, it 
helps train the regressor to reproduce the landmarks in the 


test image similar to those in the training set. Second, it 
prepares the regressor to work with unseen identity which 
does not appear in the training set O. In such case the 
displacement error D may be large to compensate for the 
difference in identities. The regression process can be ex¬ 
pressed as P^^^ = where fr is the regression 

function, / is the current frame, P'^^ and P^'^^ are the in¬ 
put (from the shape regression for the previous frame) and 
output shape parameter sets, respectively. The coarse esti¬ 
mates P^^^ are refined further in the next stage, using more 
precise energy optimization added with depth information. 
Specifically, 0 = (R^T^e) are optimized w.r.t both the 2D 
prior constraints provided by the estimated 2D landmarks 
by the shape regressor and the 3D point cloud. Lastly, the 
identity vector Wid is re-estimated given the current trans¬ 
formation. 

3. 3D Shape Regression 

As mentioned in Section |2.2[ the shape regressor re¬ 
gresses over the parameter vector P = (P, T, e,P). To 
train the regressor, we must first recover these parameters 
from training samples, and form training data pairs to pro¬ 
vide to the training algorithm. In this work, we use the face 
databases from IHEtIITI for training. 

3.1. Shape Parameters Estimation from Training 
Data 

We follow the parameter estimation process in El. De- 
noting lip the camera projection function from 3D world 
coordinates to 2D image coordinates, (P, T, Wid, "^exp) are 
first extracted by minimizing the 2D errors in each sample: 

Ni 2 

Ll ||np ) +t) -k 

R,T,Wid,We^p W \ ^ ^ J 

1=1 

(5) 

where {li\i = 1,..., A"/} are the ground truth landmarks of 
the training data and Ni = 73. Note that Wexp be dis- 











carded since we only need Wid to generate the individual 
expression blendshapes {Bj } of the current subject as in ^ 
for later optimization over (R,T,e). 

With the initially extracted parameters in 0, we refine 
by altematingly optimizing over Wid and (RJ'^Wexp)- 
Particularly, we first keep (R, T, We^p) fixed for each sam¬ 
ple, and optimize over Wid across all the samples of the 
same subject: 

Ns Ni 

min EEK( Rk(^CrX2wfd><3^ke^^^ 

/c=l i=l * 

( 6 ) 

where Ng denotes the total number of training samples for 
the same subject. Then for each sample we keep Wid fixed 
and optimize over (R, T, Wexp) as in This process is 
repeated until convergence. We empirically observe that 
running the above process for three iterations gives reason¬ 
ably good results. We then can generate user-specific blend- 
shapes {Bi} SLS in 0- 

Finally, we recover the expression weights e by minimiz¬ 
ing the 2D error over (R,T,e) again: 


Ni 


min 

i?,T,e 


E 11^*11' 


(7) 


where Di = Bp (Si) — k and Si is a 3D landmark vertex of 
the blendshape corresponding to li. From Q, we also ob¬ 
tain the 2D displacement vector D = as a by-product. 
Eventually, following (Si, for each training data sample, we 
generate a number of guess-truth pairs , Pf , Ff }, where 
the guessed vector is produced by randomly perturbing 
the ground truth parameters P^ extracted through the above 
optimization. In this way, we create N training pairs in to¬ 
tal. 


3.2. Shape Regression Training 

Given the training pairs from the previous section, we 
follow the feature extraction and shape regression method 
in 1241 . which combines local binary features extracted us¬ 
ing the trained random forests of all the landmarks. The 
local binary features are aggregated into a global feature 
vector which is then used to train a linear regression model 
to predict the shape parameters. In our work, we train the re¬ 
gressor to predict (i?, T, e, D) simultaneously, directly from 
the input RGB frame in contrast to 1241 where the regressor 
simply updates only the 2D displacements. 

Algorithm shows the detailed training procedure. In 
particular, we calculate the 2D landmark positions from the 
shape parameters, and for each landmark /i, we randomly 
sample pixel intensity-difference features m within a ra¬ 
dius Vi. These pixel-difference features are then used to 
train a random forest Foresti. For every training sample 


Mk, we pass it through the forest and recover a binary vec¬ 
tor which has the length equal to the number of leaf 
nodes of the forest. Each node that responds to the sample 
will be represented as 1 in otherwise it will be 0. The 
local binary vectors from Ni landmarks are concatenated 
to form a global binary vector representing the training 
sample k. Then, the global binary feature vectors are used 
to learn a global linear regression matrix W which predicts 
the updating shape parameters AP from those binary global 
vectors. After that, the guessed shape parameters are up¬ 
dated and enter the next iteration. 


Algorithm 1: The regressor training algorithm 

Data: N training samples = {//c, 

Result: The shape regressor 

1 for t ^ 1 to Nf do 

2 

for i 1 to Ni do 

3 


Eoresti i — TrainForest(/i); 

4 


for /c 4- 1 to A^ do 

5 


Ek,i <— PassCM/e, Foresti); 

6 


end 

7 

end 

8 

for /c 4- 1 to do 

9 


$*(4,P^ ^ concat{Fk,i)-, 

10 

end 

11 

N 9 0 

min^ \\APl-W^^Uh,Pl-^)\\ 



k=l 

12 

for /c ^ 1 to do 

13 



14 

end 

15 end 



Similar to (241, we let the regressor learn the best search 
radius during training. The training face samples have 
been normalized to the size of approximately 120x120 pix¬ 
els, about the same size as the face captured by Kinect at 
0.7m distance. Thus at runtime, we simply rescale the ra¬ 
dius inversely proportional to the current z-translation Tz . 

4. 3D Shape Refinement 

At this stage, we refine the shape parameters using both 
RGB and depth images, and also update the shape identity. 
Specifically, (i?, T) and e are altematingly optimized. After 
convergence, the identity vector Wid is updated based on the 
final shape parameters vector 0 = (i?, T, e). 

4.1. Facial Shape Expressions and Global Transfor¬ 
mation 

We simultaneously refine (i?, T, e) by optimizing the 
following energy: 

R,T,e = arg min E 2 D + ujsdEsd + Ereg (8) 








where cjs^:) is a tradeoff parameter, E 2 D is the 2D error term 
measuring the 2D displacement errors, is the 3D ICP 
energy term measuring the geometry matching between the 
3D face shape model and the input point cloud, and E^eg 
is the regularization term to ensure the shape parameter re¬ 
finement is smooth across the time. Particularly, E 2 D, 
and Ereg are defined as 


1 

E2D = ^^mASi{R,T,e))-k\t (9) 


i=l 
Nd 


i^3D = ;^^((5fe(i?,T,e)-4)-nfe)' (10) 


k=l 

Ereg = a||6» - rf + /3 e- 26»0-i) + 6»0-2) 


( 11 ) 


In Q, the tracked 2D landmarks {li} are computed from 
the raw shape parameters {R^T^e, D), which are usually 
quite reliable. In ( p^ , is the number of ICP correspond¬ 
ing pairs that we sample from the blendshape and the point 
cloud, and dk and Uk denote point k in the point cloud and 
its normal, respectively. By minimizing E^d, we essen¬ 
tially minimize the point-to-plane ICP distance between the 
blendshape and the point cloud 1X91 . This is to help slide the 
blendshape over the point cloud to avoid local minima and 
recover a more accurate pose. In ( pd] ), 6>* is the raw output 
(i?, T, e) from the shape regressor,^^”^^ and are the 
shape parameters from the previous two frames, and a and 
P are tradeoff parameters. The two terms in represent 
a data fidelity term and a Laplacian smoothness term. 

In our implementation, we iteratively optimize over the 
global transformation parameters (i?, T) and the local de¬ 
formation parameter e, which leads to faster convergence 
and lower computational cost. In the (i?, T) optimization, 
uj^D is set to 2; a, [3 are set to 100 and 10000 for R, 0.1 and 
10 for T, respectively. For optimization over e, uj^d is set 
to 0.5; a and p are both set to zero so as to maximize spon¬ 
taneous local deformations. The non-linear energy function 
is minimized using the ALGLIB::BLEIC bounded solvei[^ 
to keep e in the valid range of [0,1]. 

Fig. gives an example to show the effect of the E^^d 
term. We can see that for the result without using E^d, 
there is a large displacement between the point cloud and 
the model and there is also noticeable over-deformation of 
the mouth. This demonstrates that without using the 3D 
information, the 2D tracking may appear fine yet the actual 
3D transformation is largely incorrect. 

4.2. Updating Shape Identity 

In the last step, we refine the identity vector to better 
adapt the expression blendshapes to the input data. We 

^ http://www.alglib.net/ 



(a) (b) (c) (d) 

Figure 3. The effect of E^d term. (a,b): The result without us¬ 
ing E^d- (c,d): The result using E^d- Notice the displace¬ 
ment between the point cloud and the model, as well as the over¬ 
deformation of the mouth in (b). 


solve for Wid by minimizing the following objective func¬ 
tion: 


Wid = arg min + wsDE'-^jg (12) 


where 


1 

^ E II^P iR{CrX2wfdX3l^), + T) - k\\" 


i=l 


Nd 


^\\R{CrX2wfdX3l^),+T-dk\\" 


(13) 


Ne-1 Ne-1 

with7=(l- ^ ej)uexpo+ E e^Wexp^-- 
i=i i=i 

Note that E'^jj is the point-to-point ICP energy and it 
behaves slightly differently from E^d in Minimizing 
E'^^ helps align the blendshape to the point cloud in a more 
direct way on the surface to recover detailed facial charac¬ 
teristics. 

In our experiments, we empirically set ccgi:) to 0.5, mean¬ 
ing that we give more weight to the 2D term to encourage 
the face model to fit closer to the tracked landmarks, espe¬ 
cially the face countour. Gradient-based optimizations such 
as BFGS are ineffective toward this energy, and thus we 
run one iteration of coordinate descent at each frame to stay 
within the computational budget. We find that Wid usually 
converges in under 10 frames after tracking starts. To save 
computational time, we set a simple rule in which updating 
identity stops either after Wid converges or after 10 frames. 

Fig. 1^ shows some results on adapting the identity pa¬ 
rameter over time. After a few iterations of updating Wid, 
the face model fits significantly better to each individual 
subject. 


5. Depth Recovery with Dense Face Priors 

In this section, we further develop one application to 
show the usefulness of the final blendshape model for each 
frame, i.e. using the dense blendshape model as the prior 
for depth recovery. Although the final blendshape itself is 
a good approximation to the real face and sufficiently good 








(a) (b) (c) 

Figure 4. Adapting identity over time, (a) The common initial base 
shape, (b) Appearances of two testers, (c) For the male tester, the 
identity parameter wid converges after three frames, compared to 
the female tester’s four frame convergence. 

for the tracking purpose, it might not be sufficient for other 
applications such as the 3D face reconstruction. Thus, it is 
meaningful to use face priors to refine noisy depth maps. 
Existing methods for depth recovery |[2T] [JSl [30l |33l 13 
usually utilize general prior information such as piece-wise 
smoothness and the corresponding color guidance, and thus 
they tend to produce a plane-like surface. To address these 
deficiencies, the use of semantic priors has also been con¬ 
sidered, e.g., rigid object priors m and non-rigid face pri¬ 
ors Col, for 3D reconstruction and depth recovery. 

Our work is based on Qol but extends it in several sig¬ 
nificant ways. cni mainly introduces the idea of using the 
face prior and focuses on the depth recovery of one single 
RGBD image with the help of face registration. It uses a 
coarse generic wireframe face model, which can only pro¬ 
vide a limited reliable depth prior. In contrast, we employ 
our optimized final blendshape model which can provide 
dense prior information. We also incorporate depth recov¬ 
ery with real-time face tracking, for which we develop a 
local filtering based depth recovery for fast processing. 

In particular, similar to Col, the recovery of depth map 
X is formulated as the following energy minimization prob¬ 
lem: 

imnEr{X) + \dEd{X, Z) + \fEf{X, V), (14) 

where the smoothness term Er{X) measures the quadratic 
variations between neighboring depth pixels, the fidelity 
term Ed{X^Z) is adopted to ensure X does not signifi¬ 
cantly depart from the depth measurement Z, and the face 
prior term Ef{X,V) utilizes the blendshape prior V to 
guide the depth recovery. We define 

Er{X) = ^ E E (15) 

i je^i 

where i and j represent the pixel index, Qi is the set of 
neighboring pixels of pixel i, aij is the normalized joint 


trilateral filtering (JTF) weight which is inversely propor¬ 
tional to pixel distance, color difference, and the depth dif¬ 
ference Col. For the fidelity term Ed{X, E), we use the 
Euclidean distance between X and E, i.e., Ed{X^Z) = 
^ 11X — EI p. For simplicity, we use V to represent the depth 
map generated by rendering the current 3D blendshape 
model at the color camera viewpoint. Then, the face prior 
term Ef{X, V) is computed as Ef{X, V) = ^ ||X - ^|p. 

A simple recursive solution to (14) is obtained by the 
vanishing gradient condition, resulting in 

X,Z{i) + XfV{i)+J2 (aij (^ji)X (j) 

X^^\i) = - ^ -, 

Ad H- A/ + 22 ^ 

jeQi 

(16) 

where the superscript represents the number of iterations. 
Such filtering process is GPU-friendly and the number of 
iterations can be explicitly controlled to achieve a better 
trade-off between recovery accuracy and speed. 

6. Experiments 

6.1. Tracking Experiments 

We carried out extensive tracking experiments on syn¬ 
thetic BU4DFE sequences and real videos captured by a 
Kinect camera. We compared the tracking performance 
of our method to that of RGB-based trackers dderEI, 
CoR(32l and REMS 1261 in terms of average root mean 
square error (RMSE) in pixel positions of 2D landmarks. 
In the tracking context, we evaluated trackers’ robustness 
by comparing the proportions of unsuccessfully tracked 
frames. 

6.1.1 Evaluations on Synthetic Data 

The BU4DFE dataset |3T1 contains sequences of high- 
resolution 3D dynamic facial expressions of human sub¬ 
jects. We rendered these sequences into RGBD to sim¬ 
ulate the Kinect camera at three distances: 1.5m, 
1.75m and 2m with added rotation and translation. In 
total, we collected tracking results from 270 sequences. 
The dataset does not provide ground truth, so we used the 
REMS tracker l26l . which works well on BU4DFE se¬ 
quences, to recover 2D landmarks on the images rendered 
at 0.6m, which were then reprojected to different distances 
and treated as ground truth. 

The overall evaluation results are shown in Table [T] 
Our tracker performed comparably to the state-of-the- 
art CoR and outperformed the blendshape-based 

DDER 0. CoR did not produce results for sequences at 
1.75m and 2m, with the faces too small for it to handle. 








Table 1. Evaluation results of the proposed method and other face 
trackers on BU4D dataset. RMSE is measured in pixels. 


Dataset 

DDERQ 

CoR|[32l 

Ours 

BU4D(L5 m) 

2.20 

1.05 

1.27 

BU4D (1.75 m) 

1.94 

n/a 

1.14 

BU4D (2.0 m) 

1.76 

n/a 

1.14 



Eigure 5. A sample from BU4DEE dataset, rendered at 1.5m. Erom 
left to right: results by CoR, DDER and our tracker. 


6.1.2 Experiments on Real Data 

We compared the tracking performance of our approach to 
other methods on 11 real sequences at various distances, 
with different lighting conditions, complex head move¬ 
ments as well as facial expressions. We used RLMS to re¬ 
cover the ground truth, and manually labeled the frames that 
were incorrectly tracked. 

The results are shown in Table For RLMS, we only 
considered the performance on frames that had been manu¬ 
ally labeled, since its results were otherwise used as ground 
truth. Note that the inclusion of RLMS is mainly used as a 
reference and does not reflect its true performance, as only 
incorrectly tracked frames were measured. Once again, our 
method outperformed DDER and was very close to CoR. 
The consistent error values demonstrated that our tracker is 
stable, particularly under large rotations or when the face is 
partially covered, as illustrated in Fig. and Fig. [7] 

To better assess the robustness of each tracker, we com¬ 
pared the percentage of aggregated lost frames from ah se¬ 
quences in TableThe mistracked frames were decided ei¬ 
ther by empty output, or by large RMSE (RMSE > r, with 
T = 10). We also did not count sequences luc03 for DDER, 
nor luc03 and luc04 for CoR, toward their overall percent¬ 
ages because the faces were not registered correctly from 
the beginning, which was perhaps largely due to the face 
detector failing to locate the face correctly. This showed 
that the 2 D-f 3D optimization combination of our method 
provides robust tracking overall. 

6.1.3 Running Time 

Our tracker is implemented in native C-h-h, parallelized with 
TBB(5 with the GPU only used for calculating the 47 base 
expression blendshapes in Running time was measured 

^https://www. threadingbuildingblocks.org/ 


Table 2. Evaluation results of the proposed method and other face 
trackers on real videos. RMSE is measured in pixels. 


Dataset 

DDERQ 

CoR 1321 

RLMS (261 

Ours 

dtOl 

9.65 

4.15 

6.04 

4.51 

arOO 

3.41 

66.72 

7.41 

2.36 

dtOO 

3.57 

1.65 

4.63 

2.29 

my 01 

5.61 

2.79 

4.35 

2.89 

fwOl 

6.5 

3.27 

36.11 

4.85 

fw02 

5.34 

1.80 

2.56 

3.50 

lucOl 

4.96 

2.38 

5.86 

3.49 

luc02 

3.95 

1.51 

2.04 

3.02 

luc03 (2m) 

37.17 

n/a 

1.67 

1.77 

luc04 (2m) 

2.63 

62.45 

n/a 

1.84 

luc05 

3.39 

2.39 

3.44 

2.88 


Table 3. The overall percentage of lost frames during tracking from 
all real videos._ 


DDERlia 

CoR (321 

RLMS (261 

Ours 

2.21% 

7.22% 

3.61% 

0.74% 


on a 3.4GHz Core-i7 CPU machine with a Geforce GT640 
GPU. Shape regression ran in Sms, reflning (R,T,e) took 
12ms, with auxiliary processing taking another 10ms. Over¬ 
all, without identity adaptation, the tracker ran at 30Hz. The 
bottleneck is in optimizing for Wid which took 14ms, while 
calculating 47 base blendshapes took 80ms on the GT640 
GPU with 384 CUDA cores. This process is only carried 
out at initialization or during tracker restarts. The use of 
modem GPU cards with higher CUDA core counts should 
remove this bottleneck. 


6.2. Depth Recovery Experiments 
6.2.1 Synthetic Data 


We used the same set of BU4DFE sequences as in sec¬ 
tion at L75m and 2m. Instead of evaluating the track¬ 


ing accuracy, we measured the surface reconstmction error 
with respect to the 3D synthetic surface used for generating 
the data. To simulate different depth ranges of the target, we 
increased the noise level of the input depth map according 
to I 22 I . We ran the tracker on these sequences and collected 
the surface of the blendshape (BS Surface) as well as the 
enhanced depth map, which was Altered using face priors 
(DRwFP). We compared these two surfaces to the ground 
truth surface. Additionally, we compared our method to the 
depth recovery method in m using Mean Absolute Error 
(MAE) \di — mm, where O is the set of valid 




depth pixels, while di and gi are the recovered and ground 
truth depth values respectively. The results are summarized 
in Tabled 

The results show that the high noise levels, often higher 
than that of the actual Kinect depth data, led to large errors 
in blendshape modeling. However, the face guidance fll- 






































RLMS CoR DOER Ours RLMS CoR DOER Ours 



Figure 6. Each group of four shows results of four trackers on the same frame. From left to right: RLMS, CoR, DDER and our method. Our 
tracker and RLMS can handle occlusion by hair. In general, our tracker is robust to large rotation and it models realistic facial deformations. 



(a) (b) (c) (d) (e) (f) 

Figure 7. Example showing that the proposed tracker can handle partial occlusion of the face. The first row shows the resulting projected 
landmarks and the head orientation as 3 axes (red, green, yellow axes are yaw, pitch and roll, respectively). The second row shows the 3D 
view of the blendshape model (in red) and the input point cloud (in white) of each corresponding frame. Except (c) where the frontal view 
is shown, (a,b,c,e,f) show the side view. In each frame, the occlusion on the point cloud is circled in yellow. Tracking performance is not 
measured for this video and it is not included in Table|^ because we recorded this sequence after we had finished all the benchmarks. 


ter mitigated these problems and recovered depth maps that range from 15% at 1.75m to 20% at 2m better than M- 
were closer to the ground truth surface. The improvements 






















Table 4. Average MAE in mm of depth reconstruction on BU4DFE 
dataset._ 


Dataset 


BS Surface 

DRwFP 

BU4DFE (1.75 m) 

2.83 

9.16 

2.39 

BU4DFE (2.0 m) 

3.59 

8.79 

2,85 


6.2.2 Real Data 

As we do not have ground truth for real data, in this section 
we only provide visual results of the recovered depth map 
at 2m. Fig. shows depth recovery results on two sample 
depth frames. It is difficult to recognize any facial charac¬ 
teristics from the raw depth maps. The filter in m smoothed 
out the depth maps but was not able to recover any facial de¬ 
tails. In contrast, our depth filter with face priors was able 
to reconstruct the facial shapes with recognizable quality. 



(a) (b) (c) (d) 

Figure 8. Depth recovery on real depth maps, (a) The blendshape 
priors (b) The raw depth maps (c) The depth maps refined by ||9| 
(d) The depth maps refined with face priors. 


7. Conclusion 

We presented a novel approach to RGBD face tracking, 
using 3D facial blendshapes to simultaneously model the 
head movements as well as facial expressions. The tracker 
is driven by a fast shape regressor, which allows the tracker 
to perform consistently at any distance, beyond the working 
range of current state-of-the-art RGBD face trackers. This 
3D shape regressor directly estimates shape parameters to¬ 
gether with 2D landmarks in the input color frame. The 
shape parameters are refined further by optimizing a well- 
designed 2D-I-3D energy function. Using this framework, 
our tracking can automatically adapt the 3D blendshapes to 
better fit the individual facial characteristics of tracked hu¬ 
mans. Through extensive experiments on synthetic and real 
RGBD videos, our tracker performed consistently well in 
complex conditions and at different distances. 


With the ability to model articulated facial expressions 
and complex head movements, our tracker can be deployed 
in various tasks such as animation and virtual reality. In 
addition, we use the blendshape as a prior in a novel depth 
filter to better reconstruct the depth map, even at larger dis¬ 
tances. The refined depth map can later be used together 
with the blendshape to reproduce the facial shape regard¬ 
less of object-camera distances. 
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